{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Time series interpolating with sktime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Suppose we have a set of time series with different lengths, i.e. different number of time points. Currently, most of sktime's functionality requires equal-length time series, so to use sktime, we need to first converted our data into equal-length time series. In this tutorial, you will learn how to use the `TSInterpolator` to do so. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-12-19T14:31:58.456171Z",
     "iopub.status.busy": "2020-12-19T14:31:58.455565Z",
     "iopub.status.idle": "2020-12-19T14:31:59.189497Z",
     "shell.execute_reply": "2020-12-19T14:31:59.190005Z"
    }
   },
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "import pandas as pd\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "from sktime.classification.interval_based import TimeSeriesForestClassifier\n",
    "from sktime.datasets import load_basic_motions\n",
    "from sktime.transformations.panel.compose import ColumnConcatenator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ordinary situation\n",
    "\n",
    "Here is a normal situation, when all time series have same length. We load a toy data set from sktime and train a classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-12-19T14:31:59.194445Z",
     "iopub.status.busy": "2020-12-19T14:31:59.193903Z",
     "iopub.status.idle": "2020-12-19T14:32:01.019896Z",
     "shell.execute_reply": "2020-12-19T14:32:01.020463Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X, y = load_basic_motions(return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
    "\n",
    "steps = [\n",
    "    (\"concatenate\", ColumnConcatenator()),\n",
    "    (\"classify\", TimeSeriesForestClassifier(n_estimators=100)),\n",
    "]\n",
    "clf = Pipeline(steps)\n",
    "clf.fit(X_train, y_train)\n",
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## If time series are unequal length, sktime's algorithm may raise an error\n",
    "\n",
    "Now we are going to spoil the data set a little bit by randomly cutting the time series. This leads to unequal-length time series. Consequently, we have an error while attempt to train a classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-12-19T14:32:01.026183Z",
     "iopub.status.busy": "2020-12-19T14:32:01.025650Z",
     "iopub.status.idle": "2020-12-19T14:32:01.239714Z",
     "shell.execute_reply": "2020-12-19T14:32:01.240542Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "IndexError: Tabularization failed, it's possible that not all series were of equal length\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/mloning/.conda/envs/sktime-dev/lib/python3.7/site-packages/numpy/core/_asarray.py:136: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray\n",
      "  return array(a, dtype, copy=False, order=order, subok=True)\n"
     ]
    }
   ],
   "source": [
    "# randomly cut the data series in-place\n",
    "def random_cut(df):\n",
    "    for row_i in range(df.shape[0]):\n",
    "        for dim_i in range(df.shape[1]):\n",
    "            ts = df.iloc[row_i][f\"dim_{dim_i}\"]\n",
    "            df.iloc[row_i][f\"dim_{dim_i}\"] = pd.Series(\n",
    "                ts.tolist()[: random.randint(len(ts) - 5, len(ts) - 3)]\n",
    "            )  # here is a problem\n",
    "\n",
    "\n",
    "X, y = load_basic_motions(return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
    "\n",
    "for df in [X_train, X_test]:\n",
    "    random_cut(df)\n",
    "\n",
    "try:\n",
    "    steps = [\n",
    "        (\"concatenate\", ColumnConcatenator()),\n",
    "        (\"classify\", TimeSeriesForestClassifier(n_estimators=100)),\n",
    "    ]\n",
    "    clf = Pipeline(steps)\n",
    "    clf.fit(X_train, y_train)\n",
    "    clf.score(X_test, y_test)\n",
    "except ValueError as e:\n",
    "    print(f\"IndexError: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Now the interpolator enters\n",
    "Now we use our interpolator to resize time series of different lengths to user-defined length. Internally, it uses linear interpolation from scipy and draws equidistant samples on the user-defined number of points. \n",
    "\n",
    "After interpolating the data, the classifier works again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2020-12-19T14:32:01.245270Z",
     "iopub.status.busy": "2020-12-19T14:32:01.244733Z",
     "iopub.status.idle": "2020-12-19T14:32:02.911970Z",
     "shell.execute_reply": "2020-12-19T14:32:02.912833Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sktime.transformations.panel.interpolate import TSInterpolator\n",
    "\n",
    "X, y = load_basic_motions(return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
    "\n",
    "for df in [X_train, X_test]:\n",
    "    random_cut(df)\n",
    "\n",
    "steps = [\n",
    "    (\"transform\", TSInterpolator(50)),\n",
    "    (\"concatenate\", ColumnConcatenator()),\n",
    "    (\"classify\", TimeSeriesForestClassifier(n_estimators=100)),\n",
    "]\n",
    "clf = Pipeline(steps)\n",
    "clf.fit(X_train, y_train)\n",
    "clf.score(X_test, y_test)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
